We have a data set about red wines quality. We are going to look intothis data set variables, try to find similarities between wines as per their quality and build a predictive model about the quality of the wines. I used some technics that were not seen during the certificate. I learnt them in college earlier in ly life and thought it would be nice to consolidate everything here!
The table has 1599 observations and each one is a different type of red wine, and has 12 variables.
Data
For each wine, we have its caracteristics (its acidity, its PH, its alcohol amount…) as well as its quality.qualiteY, is different from the others : it takes the value 1 if the wine is good quality, 0 if not.
This variable can help building an explicative model from the first 11 variables, explicatives.
Missing values
There are no missing values in this dataset.
Proportion of the target variable
The 2 class of qualiteY are balances
###Variables Distribution
Most of the variables are almost normally distributed, except for residual.sugar and chloride which are skewed on the left.
After studying the density of the variables, I want to study the links between the other explicatives variables. I am trying to find out if some variables are correlated, meaning if there is a link between two or more variables in a sens were their values are always:
Let’s look more in depths at the links between these variables to understand the nature of the correlation.
We can see there is a linear relation between these different variables. Let’s now look at the links between these explanatory variables and the variable to explain:qualiteY.
We are trying to understand if there is an influence from the variables on wine quality.
We plotted the distribution of these 11 explanatory variables depending on the quality of the wine. Comparing the distributions of each variable depending on qualiteY = 0 and qualiteY = 1, we can visually see that the distributions are very different between qualiteY = 0 and qualiteY = 1 for 4 explanatory variables volatile acidity, citric.acid, sulphate and alcohol.
Thus, we are showing only these 4 variables below:
It looks like these variable have an influence on the quality of a wine.
To statiscally prove this intuition, we realised a test of the means for each explanatory variable, between qualiteY = 0 and qualiteY = 1. The hypothesis \(H0\) of the means test is the situation were the means are equal.
Let’s start for instance with volatile.acidity, the test will make us able to check if the mean of volatile.acidity for the wine quality 0 is equal to the mean of volatile.acidity for the wine quality 1.
We fix the threshold at 5%. The alternative hypothesis will be rejected if the p-value is above 5%.
By applying the means test on these 4 variables, here is what we conclude:
The good quality wines have:
a level of volatile acidity below bad wines because the P-value is 1.710^{-39} < 5%.
a level of citric acid higher than bad wines because the P-value is 10^{-10} < 5%.
a level of sulphates higher than bad wines because the P-value is 10^{-18} < 5%.
a level of d’alcohol higher than bad wines because the P-value is 10^{-77} < 5%.
In order to go further in the analysis, we are going to perform a principal component analysis (PCA) which will allow us to see if we can flag individuals groups and variables by lowering the dimensions.
The analysis in principal components allow to analyse and visualise a dataset with individuals described by multiple quantitative variables. It is therefore possible to study the similarities between individuals with a view on all variables and to understand individual profiles by lowering the dimensions.
I do an PCA on this data set to understand if there is a combination of these 11 explanatory variables than can explain the wine quality.
The variance represent the information within a dataset. The idea is to reduce the number of dimensions while not loosing too much information. We choose to keep 70% of the information from the data set and to reduce the number of dimensions from 11 to 4.
## percentage of variance cumulative percentage of variance
## comp 1 28.1739313 28.17393
## comp 2 17.5082699 45.68220
## comp 3 14.0958499 59.77805
## comp 4 11.0293866 70.80744
## comp 5 8.7208370 79.52827
## comp 6 5.9964388 85.52471
## comp 7 5.3071929 90.83191
## comp 8 3.8450609 94.67697
## comp 9 3.1331102 97.81008
## comp 10 1.6484833 99.45856
## comp 11 0.5414392 100.00000
There is no strong uncoupling on the scree plot, except between the first and the second dimension. We will stay with the analysis of the first 4 axis.
###Variable analysis
With a PCA, each axis is a linear combination of the variables.
The variance explained with the 2 first axis is 45%.
AXE 1 : Axis 1 represents wine acidity. It set against two variables very correlated (citric acid and fixed acidity), with the pH. Un vin acide aura un pH faible pour une mesure de fixed.acidity élevée. We already saw in the correlation matrix that these variables were negatively correlated.
AXE 2 : Axis 2 represents the sulfure in the wine (Free sulfure and total.sulfure, positively correlated). These variables are negatively correlated with alcohol.
No variable is well represented on the axis 3, we can still see a negative correlation between alcohol and volatile.acidity. Axis 4 is represented by chlorides and sulfats which are positively correlated.
###Individual analysis
Concentration ellipse
Good quality wine have a tendancy to have a lower sulfat rate vs bad quality wines. However, acidity doesn’t seem to impact wine quality.
Good quality wines have an alcohol percentage higher and a lower volatile acidity vs lower quality wines. Some individuals are standing out : 152,1436,1477.
These points being very represented, let’s have a closer look at them to understand why they stand out.
We can see that the data residual sugar (except for 152) and free sulfur are high for these 3 individuals - a lot higher than the median.
The variable qualiteY is a binary variable which gives us the wine quality, as 0 if the wine is bad quality and 1 if it is a good quality wine. We are going to use this variable in a logistic regression to create a model able to explain a wine quality depending on its caracteristics.
We are trying to understand which variable are best able to explain a wine quality.
The function step allows to select a model with a step by step procedure based on minimalising the AIC criteria. It allows me to keep only the relevant variables for my model and to delete the variable that do not contribute to it or add noise only.
The model keeps these variables: fixed.acidity - volatile.acidity - citric.acid - chlorides - free.sulfur.dioxide
total.sulfur.dioxide - sulphates - alcohol
##
## Call:
## glm(formula = qualiteY ~ fixed.acidity + volatile.acidity + citric.acid +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## sulphates + alcohol, family = "binomial", data = vin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3340 -0.8488 0.3242 0.8294 2.3493
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.216919 0.949966 -9.702 < 2e-16 ***
## fixed.acidity 0.127271 0.051081 2.492 0.01272 *
## volatile.acidity -3.379881 0.477983 -7.071 1.54e-12 ***
## citric.acid -1.260357 0.560972 -2.247 0.02466 *
## chlorides -3.529121 1.509122 -2.339 0.01936 *
## free.sulfur.dioxide 0.022082 0.008184 2.698 0.00697 **
## total.sulfur.dioxide -0.015645 0.002811 -5.565 2.62e-08 ***
## sulphates 2.686254 0.432624 6.209 5.32e-10 ***
## alcohol 0.905412 0.073423 12.331 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2209.0 on 1598 degrees of freedom
## Residual deviance: 1657.8 on 1590 degrees of freedom
## AIC: 1675.8
##
## Number of Fisher Scoring iterations: 4
Alcohol is the most significant variable to forecast a wine quality. It is the variable that brings the most information. The more alcohol increases, the more the probability of having a good quality wine is increasing (positive estimate). On the contrary, the more volatile acidity increases, the more the probability of having a good quality wine is low (negative estimate).
If we need to predict a wine quality on new data, we will build a predictive model on a training model which will be tested on a test dataset to understand our error rate.
The sample contains enough data, we can divide it in 2 samples for test and training.
I make sure of the proportion of qualiteY in my two samples :
Train
TestThe proportions are equivalent.
The ROC curve (Receiver Operator Characteristic Curve) represents the ratio of true positive on the y axix vs the ratio of false positives on the y axis.
The AUC (Area under the curve) gives the classification rate without error compared to a logistic model, on the test sample. It allows to compare the ROC curves between multiple models.
AUC of the predictive model : 0.82
If I get new data, I can expect an error rate of 24.14% with this predictive model, and by picking a threshold of 0.5.
We are going to at the threshold that will allow us to minimize this error rate.
## threshold specificity sensitivity
## 1 0.5139985 0.7837838 0.748538
Let’s fix the threshold to 0.51 to create the confusion matrix.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0451 0.3105 0.5142 0.5377 0.8057 0.9859
At the threshold of 0.51, the error rate is 23.51%. If I get new data, I can expect an error rate of 23.51% with this predictive model. When we receive new data, we’ll need to re-calibrate it.
A good residual is a residual without any expored structure. Residuals need to be independant from observations.
We will need to re start the model without line 653.
In theory, 95% of the residual from Student are within the interval [-2,2]. It is the case here as 30 residuals are outside of the interval ( 2.34%).
I will compare with other models to understand if they give better results than the logistic regression model.
The AUC with this model is 0.82.
The AUC with this model is 0.77.
The AUC with this model is 0.92.
Random forest is the model that allows for the best results (highest AUC).
Confusion matric on Random Forest with a threshold of 0.5 :
With a random forest, I can expect an error rate of 15.36% on new data.